Visualizing clusters in Python

I am wanting to see the results of clustering methods such as K-Means; this is my playground.

Initial examples are taken from K Means Clustering in Python


In [ ]:
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.cluster import KMeans
import sklearn.metrics as sm

import pandas as pd
import numpy as np

# Only needed if you want to display your plots inline if using Notebook
# change inline to auto if you have Spyder installed
%matplotlib inline

Using Iris Dataset


In [ ]:
# import some data to play with
iris = datasets.load_iris()
# look at individual aspects by uncommenting the below
#iris.data
#iris.feature_names
#iris.target
#iris.target_names

Original author converted the data to Pandas Dataframes. Note that we have separated out the inputs (x) and the outputs/labels (y).


In [ ]:
# Store the inputs as a Pandas Dataframe and set the column names
x = pd.DataFrame(iris.data)
x.columns = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width']

y = pd.DataFrame(iris.target)
y.columns = ['Targets']

Visualise the data

It is always important to have a look at the data. We will do this by plotting two scatter plots. One looking at the Sepal values and another looking at Petal. We will also set it to use some colours so it is clearer.


In [ ]:
# Set the size of the plot
plt.figure(figsize=(14,7))

# Create a colormap
colormap = np.array(['red', 'lime', 'black'])

# Plot Sepal
plt.subplot(1, 2, 1)
plt.scatter(x.Sepal_Length, x.Sepal_Width, c=colormap[y.Targets], s=40)
plt.title('Sepal')

plt.subplot(1, 2, 2)
plt.scatter(x.Petal_Length, x.Petal_Width, c=colormap[y.Targets], s=40)
plt.title('Petal');

Build the K Means Model - non-Spark example

This is the easy part, providing you have the data in the correct format (which we do). Here we only need two lines. First we create the model and specify the number of clusters the model should find (n_clusters=3) next we fit the model to the data.


In [ ]:
# K Means Cluster
model = KMeans(n_clusters=3)
model.fit(x)
1
2
3
# K Means Cluster
model = KMeans(n_clusters=3)
model.fit(x)

# This is what KMeans thought
model.labels_

Visualise the classifier results

Let's plot the actual classes against the predicted classes from the K Means model.

Here we are plotting the Petal Length and Width, however each plot changes the colors of the points using either c=colormap[y.Targets] for the original class and c=colormap[model.labels_] for the predicted classess.


In [ ]:
# View the results
# Set the size of the plot
plt.figure(figsize=(14,7))

# Create a colormap
colormap = np.array(['red', 'lime', 'black'])

# Plot the Original Classifications
plt.subplot(1, 2, 1)
plt.scatter(x.Petal_Length, x.Petal_Width, c=colormap[y.Targets], s=40)
plt.title('Real Classification')

# Plot the Models Classifications
plt.subplot(1, 2, 2)
plt.scatter(x.Petal_Length, x.Petal_Width, c=colormap[model.labels_], s=40)
plt.title('K Mean Classification');

Fixing the coloring

Here we are going to change the class labels, we are not changing the any of the classification groups we are simply giving each group the correct number. We need to do this for measuring the performance.

Using this code below we using the np.choose() to assign new values, basically we are changing the 1’s in the predicted values to 0’s and the 0’s to 1’s. Class 2 matched so we can leave. By running the two print functions you can see that all we have done is swap the values.

NOTE: your results might be different to mine, if so you will have to figure out which class matches which and adjust the order of the values in the np.choose() function.


In [ ]:
# The fix, we convert all the 1s to 0s and 0s to 1s.
predY = np.choose(model.labels_, [1, 0, 2]).astype(np.int64)
print (model.labels_)
print (predY)

Re-plot

Now we can re plot the data as before but using predY instead of model.labels_.


In [ ]:
# View the results
# Set the size of the plot
plt.figure(figsize=(14,7))

# Create a colormap
colormap = np.array(['red', 'lime', 'black'])

# Plot Orginal
plt.subplot(1, 2, 1)
plt.scatter(x.Petal_Length, x.Petal_Width, c=colormap[y.Targets], s=40)
plt.title('Real Classification')

# Plot Predicted with corrected values
plt.subplot(1, 2, 2)
plt.scatter(x.Petal_Length, x.Petal_Width, c=colormap[predY], s=40)
plt.title('K Mean Classification');